-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates for condor on CMS Connect in singularity #3845
base: mg265UL
Are you sure you want to change the base?
Conversation
@@ -581,6 +585,10 @@ make_gridpack () { | |||
echo "cleaning temporary gridpack" | |||
rm $WORKDIR/pilotrun_gridpack.tar.gz | |||
|
|||
# awightma start: Force the re-weight step to only use 1 core | |||
echo "nb_core = 1" >> $WORKDIR/process/madevent/Cards/me5_configuration.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not affect phase space integration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is the step after when EFT reweighting is done. I can remove this too since it's not for general use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to a bug in madgraph that doesn't work when nb_core is not 1? If so, it might be worth keeping it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might no longer be the case, but when the reweighting step gets run using a multicore setup it would crash or otherwise incorrectly combine the jobs produced by each core.
From what I remember it seemed like some sort of race condition where the code would crash if the first/initial job didn't finish first and would complete when the first job finished first, but in this case I was skeptical of the output since it seemed like the subsequent jobs would overwrite the output in a way that did not seem correct (e.g. the output file size seemed to bounce around as the jobs completed with no indication that there was some sort of post-job merging going on).
The other reason to limit the number of cores explicitly is b/c otherwise madgraph will default to try and use as many cores as possible on the host machine and these gridpacks were produced using CMSConnect, which meant that the reweighting step would end up saturating every possible core on the submission node for CMSConnect, which would be very bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... yes madgraph had this bug for the initial phase space integration long time ago (eating up all cores) but looks like for reweighting this behavior is still there in 2.6.5 (not sure about 2.9.x versions).
@bbilin @lviliani maybe consider this as well in master branch? or i am raising this up again which @DickyChant and I were hoping for : it's a lot easier just to keep master as the default for all campaigns instead of spending time to fix sth on both branches and deprecate old ones if the analyses are just new
@@ -694,6 +703,8 @@ else | |||
cmssw_version=CMSSW_10_2_24_patch1 | |||
elif [[ $SYSTEM_RELEASE == *"release 7"* ]]; then | |||
cmssw_version=CMSSW_10_6_19 | |||
elif [[ $SYSTEM_RELEASE == *"release 9"* ]]; then | |||
cmssw_version=CMSSW_13_2_9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
el9 gridpacks won't work for official production yet (that's why we just didn't care about el9 so much in the script), or is the container-in-container prepared? @DickyChant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should officially disable el9 instead of making it work like this (el9 gridpacks are not going to be usable for official production) , talked to sitian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
if [ -e "$input_files" ]; then rm "$input_files"; fi | ||
tar -zchf "$input_files" "$card_dir" "$patches_directory" "$utilities_dir" "${plugin_directory}" | ||
tar -zchf "$input_files" "$card_dir" "$patches_directory" "$utilities_dir" "${plugin_directory}" "${addons_dir}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not safe since not all people are going to be using a model that lives under addons directory and tar command will eventually crash if it does not exist. why not just steer it with extramodels.dat
?
this will be wgetted from link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, I've removed this. Our current model is not in https://cms-project-generators.web.cern.ch/cms-project-generators/. Can we add it or can only people in the gen group make changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should ask GEN conveners
This PR fixes condor on CMS Connect. The
cmssw-cc7-condor-python27
singularity container is required to run, and must be activated from the topgenproductions
directory.